Global Happiness Analysis¶

DATA 604_L01¶

Group 10¶

  • Bo Li (30212597)
  • Brian Ho (30222881)
  • Ethan Mah (30009425)

1. Introduction¶

Our project investigates the attributes that contribute to the successful relationship between a country and its citizens, focusing on social well-being and mental health. In this project, we aim to investigate how a country's happiness score, a subjective notion, can be influenced by objective factors such as country population, density, and education inequality. We will also analyze the correlation between a country's population and educational inequality. This project is also important as it pinpoints specific areas where countries can improve based on the perspective of their citizens. Effective use of this information can lead to improved living conditions, and social cohesion, and promote sustainable development.

At the core of this project, our objective is to provide the public with valuable insights into the well-being of nations and their citizens, helping policymakers understand the factors that contribute to an individual’s happiness. Such insights can identify trends, disparities, and successful policies that can be shared and adapted by other countries to improve their citizen’s quality of life. In addition, this information can be extremely helpful to those who are considering moving to a different country and want to evaluate the different kinds of factors that promote liveability. For example, someone who comes from a country with a low social support score may have a higher priority in that attribute and want to move to a different country to meet that specific need.

2. Datasets¶

2.1 The source and licensing of the dataset:¶

Our chosen datasets originate from the World Happiness Report, Worldometer, and the Human Development Report, which explore social well-being in terms of personal happiness, a country’s population, and a country’s education inequality respectively. These cleaned CSV datasets are organized in a tabular format and are free from unintentional missing or duplicate values. These datasets were directly shared with and downloaded from Kaggle.com, and are therefore available for public use. We will conjoin these tables together to uncover underlying relationships and patterns between fields that was not possible if these datasets were alone. To complete this task, we will first:

  1. Read the CSV's as a dataframe
  2. Connect to a mutual database
  3. Write the prior dataframe's into a table by importing dataframe information
  4. Test each database table - dataframe separately

2.2 Read CSV's and establish connection to database:¶

In [3]:
# Libraries used:

import pandas as pd
import sqlalchemy as sq
import requests
In [4]:
### Read CSV as dataframe tables: ###
# Multiples of the same datasets were uploaded as each member did their investigation
#  separately and differently to get our final results.

## 1. Happiness Report dataset: ##

# Happiness table used for Question 1.
url = "https://github.com/boli3ucalgary/604/raw/main/DataForTable2.1WHR2023.xls"
response = requests.get(url)
with open("DataForTable2.1WHR2023.xls", "wb") as file:
    file.write(response.content)
happyreport = pd.read_excel("DataForTable2.1WHR2023.xls")
%pip install openpyxl

# Happiness table used for Question 2.
score = pd.read_csv("WHR20_DataForFigure2.1.csv")

# Happiness table used for Question 3.
happiness = pd.read_csv("WHR20_DataForFigure2.1.csv")


## 2. Population 2020 dataset ##

# Population table used for Question 2.
populations = pd.read_csv("population_by_country_2020.csv", na_values= "N.A.")
populations.columns = ["Country", "Population", "Growth", "Net Change", "Density", "Area", "Migrants", "Fertility Rate", "Median Age", "Urban Population", "World Share"]

# Population table used for Question 3.
population2020 = pd.read_csv("population_by_country_2020.csv")
# Needed to change column name as a Programming Error would occur when calling for this dataset
population2020.rename(columns = {'Urban Pop %':'Urban Pop Percentage'}, inplace = True)



## 3. Inequality Education dataset: ##

# Education Inequality table used for Question 1.
Inequalityreport = pd.read_csv("Inequality in Education.csv")

# Education Inequality table used for Question 3.
inequality_ed = pd.read_csv("Inequality in Education.csv")


# Connect to your database; include a cell at the bottom of this notebook to dispose of your engine object
sq.__version__

engine = sq.create_engine('mysql+mysqlconnector://l01-10:Ois04E11CcdNr@datasciencedb2.ucalgary.ca/l01-10')
Requirement already satisfied: openpyxl in /opt/conda/lib/python3.10/site-packages (3.0.10)
Requirement already satisfied: et_xmlfile in /opt/conda/lib/python3.10/site-packages (from openpyxl) (1.0.1)
Note: you may need to restart the kernel to use updated packages.
In [4]:
# Run Engine - push data tables to the cloud (only need to do once)

happiness.to_sql('happiness', engine, if_exists='replace' )
Out[4]:
153
In [5]:
population2020.to_sql('population2020', engine , if_exists='replace')
Out[5]:
235
In [6]:
inequality_ed.to_sql('education', engine , if_exists='replace')
Out[6]:
195
In [7]:
happyreport.to_sql('happy', engine, if_exists='replace')
Out[7]:
2199
In [8]:
Inequalityreport.to_sql('inequality', engine, if_exists='replace')
Out[8]:
195
In [10]:
score.to_sql('happiness_db', engine, if_exists='replace')
Out[10]:
153
In [11]:
populations.to_sql('population_db', engine, if_exists='replace')
Out[11]:
235

2.3 Data Cleaning Challanges:¶

  • We needed to change the name of the "Urban Pop %" column in the population dataset to avoid a programming error that would occur when running a query. After the name change, we were able to run the query without any issues.
  • While combining datasets using the "Country" column for matching, a notable challenge arose. The difficulty came from certain countries sharing common words in their names. For example, in the happiness dataset, countries such as China, Hong Kong S.A.R. of China, and Taiwan Province of China all have the word "China" in their names. This similarity posed a potential issue during the join operation, particularly when using the LIKE operator to match rows with similar key values. It could lead to the unintentional duplication of rows. Consequently, the process of joining datasets demands a more careful examination to ensure the accuracy of the matching operation.

2.4 Individual Datasets:¶

  1. World Happiness Report – 2020: A global survey that assess and ranks countries based on factors like income, social support, and life expectancy, with the goal to measure and understand subjective well-being.
  • The World Happiness Report for 2020 contains 153 data instances, highlighting each country’s happiness score and the 18 field components that contributed to that final grade. Each data item has a predefined nature and has a consistent data type throughout each column. The dataset captures a wide range of features, spanning from demographic information such as the region of the country, to average scores relating to one’s well-being, for example, life expectancy and social support. The 20 columns about this dataset focus largely on the following factors: Country, name, regional indicator, GDP per capita, healthy life expectancy, social support, freedom to make life choices, generosity, perception of corruption, and residual error. Our study will focus on the columns, country name, and ladder score.
In [5]:
# Call table of happiness database

happiness_df = pd.read_sql_table("happiness", engine)
happiness_df.head(10)
Out[5]:
index Country name Regional indicator Ladder score Standard error of ladder score upperwhisker lowerwhisker Logged GDP per capita Social support Healthy life expectancy ... Generosity Perceptions of corruption Ladder score in Dystopia Explained by: Log GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption Dystopia + residual
0 0 Finland Western Europe 7.8087 0.031156 7.869766 7.747634 10.639267 0.954330 71.900825 ... -0.059482 0.195445 1.972317 1.285190 1.499526 0.961271 0.662317 0.159670 0.477857 2.762835
1 1 Denmark Western Europe 7.6456 0.033492 7.711245 7.579955 10.774001 0.955991 72.402504 ... 0.066202 0.168489 1.972317 1.326949 1.503449 0.979333 0.665040 0.242793 0.495260 2.432741
2 2 Switzerland Western Europe 7.5599 0.035014 7.628528 7.491272 10.979933 0.942847 74.102448 ... 0.105911 0.303728 1.972317 1.390774 1.472403 1.040533 0.628954 0.269056 0.407946 2.350267
3 3 Iceland Western Europe 7.5045 0.059616 7.621347 7.387653 10.772559 0.974670 73.000000 ... 0.246944 0.711710 1.972317 1.326502 1.547567 1.000843 0.661981 0.362330 0.144541 2.460688
4 4 Norway Western Europe 7.4880 0.034837 7.556281 7.419719 11.087804 0.952487 73.200783 ... 0.134533 0.263218 1.972317 1.424207 1.495173 1.008072 0.670201 0.287985 0.434101 2.168266
5 5 Netherlands Western Europe 7.4489 0.027792 7.503372 7.394428 10.812712 0.939139 72.300919 ... 0.207612 0.364717 1.972317 1.338946 1.463646 0.975675 0.613626 0.336318 0.368570 2.352117
6 6 Sweden Western Europe 7.3535 0.036234 7.424519 7.282481 10.758794 0.926311 72.600769 ... 0.111615 0.250880 1.972317 1.322235 1.433348 0.986470 0.650298 0.272828 0.442066 2.246299
7 7 New Zealand North America and ANZ 7.2996 0.039465 7.376953 7.222248 10.500943 0.949119 73.202629 ... 0.191598 0.221139 1.972317 1.242318 1.487218 1.008138 0.646790 0.325726 0.461268 2.128108
8 8 Austria Western Europe 7.2942 0.033365 7.359596 7.228804 10.742824 0.928046 73.002502 ... 0.085429 0.499955 1.972317 1.317286 1.437445 1.000934 0.603369 0.255510 0.281256 2.398446
9 9 Luxembourg Western Europe 7.2375 0.030852 7.297970 7.177031 11.450681 0.906912 72.599998 ... -0.004621 0.367084 1.972317 1.536676 1.387528 0.986443 0.610137 0.195954 0.367041 2.153700

10 rows × 21 columns

  1. Population by Country – 2020: Contains population statistics for various countries during 2020.
  • The country population dataset for 2020 contains 235 data instances, exploring a unique country’s population and 9 other fields investigating its growth or decline. Each data item has a predefined nature and has a consistent data type throughout each column. The dataset captures a thorough investigation relating to a country’s population change by exploring the following fields: Country name, population (2020), yearly change, net change, density, land area, migrants, fertility rate median age, urban population percentage, and world share. For our project, we will focus on the columns: country name, population density, median age, percentage of urban population, and fertility rate.
In [13]:
# Call table of population database

population_df = pd.read_sql_table("population2020", engine)
population_df.head(10)
Out[13]:
index Country (or dependency) Population (2020) Yearly Change Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban Pop Percentage World Share
0 0 China 1440297825 0.39 % 5540090 153 9388211 -348399.0 1.7 38 61 % 18.47 %
1 1 India 1382345085 0.99 % 13586631 464 2973190 -532687.0 2.2 28 35 % 17.70 %
2 2 United States 331341050 0.59 % 1937734 36 9147420 954806.0 1.8 38 83 % 4.25 %
3 3 Indonesia 274021604 1.07 % 2898047 151 1811570 -98955.0 2.3 30 56 % 3.51 %
4 4 Pakistan 221612785 2.00 % 4327022 287 770880 -233379.0 3.6 23 35 % 2.83 %
5 5 Brazil 212821986 0.72 % 1509890 25 8358140 21200.0 1.7 33 88 % 2.73 %
6 6 Nigeria 206984347 2.58 % 5175990 226 910770 -60000.0 5.4 18 52 % 2.64 %
7 7 Bangladesh 164972348 1.01 % 1643222 1265 130170 -369501.0 2.1 28 39 % 2.11 %
8 8 Russia 145945524 0.04 % 62206 9 16376870 182456.0 1.8 40 74 % 1.87 %
9 9 Mexico 129166028 1.06 % 1357224 66 1943950 -60000.0 2.1 29 84 % 1.65 %
  1. Education Inequality: This topic refers to the unequal distribution of academic resources in terms of school funding, experienced teachers, and class material, based on various factors like age, gender, income, etc. ("Teahcing about educational inequality," n.d.).
  • The Education Inequality dataset holds 195 data instances over 17 fields. Each data instance is a unique country and includes information such as Human development groups, UNDP developing regions, Human Development Index (HDI) ranks as of 2021, and Inequality in Education scores spanning from the years 2010 to 2021. Each data item has a predefined nature and a consistent data type throughout each column. Our study will use the country name, human development groups, human development index (HDI) rank, and education inequality for 2020 columns.
In [6]:
# Call table of education inequality database

inequality_ed_df = pd.read_sql_table("education", engine)
inequality_ed_df.head(10)
Out[6]:
index ISO3 Country Human Development Groups UNDP Developing Regions HDI Rank (2021) Inequality in Education (2010) Inequality in Education (2011) Inequality in Education (2012) Inequality in Education (2013) Inequality in Education (2014) Inequality in Education (2015) Inequality in Education (2016) Inequality in Education (2017) Inequality in Education (2018) Inequality in Education (2019) Inequality in Education (2020) Inequality in Education (2021)
0 0 AFG Afghanistan Low SA 180.0 42.809000 44.823380 44.823380 44.823380 44.823380 45.365170 45.365170 45.365170 45.365170 45.365170 45.365170 45.365170
1 1 AGO Angola Medium SSA 148.0 NaN NaN NaN NaN NaN 34.171440 34.171440 34.171440 34.171440 34.171440 34.171440 34.171440
2 2 ALB Albania High ECA 67.0 11.900000 11.900000 11.900000 11.900000 11.900000 11.900000 11.900000 12.333440 12.333440 12.333440 12.333440 12.333440
3 3 AND Andorra Very High None 40.0 15.160302 15.160302 15.160302 15.160302 9.965681 10.083815 10.008154 10.008154 10.008154 10.008154 10.008154 10.008154
4 4 ARE United Arab Emirates Very High AS 26.0 NaN NaN NaN NaN NaN NaN 18.241437 14.475335 12.634355 12.634355 12.634355 12.634355
5 5 ARG Argentina Very High LAC 47.0 6.914310 6.830900 6.430560 6.513360 6.192280 6.331310 6.151940 6.208430 5.951810 5.787290 5.787290 5.787290
6 6 ARM Armenia High ECA 85.0 3.685000 3.685000 3.685000 3.685000 3.685000 3.685000 2.934950 2.934950 2.934950 2.934950 2.934950 2.934950
7 7 ATG Antigua and Barbuda High LAC 71.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 8 AUS Australia Very High None 5.0 2.746427 2.477821 2.475142 2.935539 2.832901 2.720964 2.672566 3.043233 3.056570 3.056570 3.056570 3.056570
9 9 AUT Austria Very High None 25.0 2.488360 3.667240 3.489860 4.297670 2.658110 2.607560 2.570640 2.999600 2.917450 2.917450 2.522410 2.522410

3. Guiding Questions¶

The following guiding questions will help determine if there are significant relationships between certain quantitative factors and the happiness of countries:

Is there a relationship between happier countries and inequality in terms of education?¶

Studies have found that obtaining an education, regardless of the level, increased one’s happiness as education tended to boost an individual’s self-esteem, and raise one’s potential income and chances of employment (Cunado & Gracia, 2011). Even though education typically increases one’s happiness regardless of the level one is enrolled in, researchers have reported a clear negative relationship between education inequality scores and happiness. This article does not specifically state that the discrepancy in education quality is the leading cause of unhappiness, but rather people tend to compare themselves with their more successful peers in areas like social status, income, and occupation. This ultimately leads to feelings of jealousy (Nikolaev, 2015). Therefore, can a country's education inequality level significantly impact the overall happiness of its citizens? Or do other factors play a more dominant role, making education inequality less influential to a country's overall happiness score?

Are there any relationships between population related factors and the happiness of a country?¶

Happiness, with a similar meaning as life satisfaction, is likely to be reflected by one’s quality of life, which comprises numerous influencing factors. Some studies have suggested the potential correlations between population-related factors and quality of life. For example, population density is found to decrease with the level of happiness, mainly due to the accompanied problems and inconveniences caused by densely populated environments, like overcrowding and traffic congestion (Belikow et al., 2021; Li & Kanazawa, 2016). Besides overall population size, the percentage of urban population is another parameter that affects a country’s happiness level, as urbanization generally tends to have a significant positive effect on an individual’s happiness (Lenzi & Perucca, 2022). While fertility is influenced by different individual and social factors, happiness has been found to be negatively associated with the number of children a woman gives birth to (Margolis & Myrskylä, 2011). Therefore, correlations seem to exist between happiness and population related factors, and we aim to identify them on a country basis using the selected datasets to determine if these factors can serve as robust indicators of a country’s happiness level.

Is there a relationship between a country's population and the amount of inequality a country has in terms of education?¶

This question investigates whether more populated countries show more academic inequality due to their population size or if education inequality exists due to other factors such as a nation’s improper allocation of resources to meet academic demand. Researchers have concluded that a person’s social class is a major predictor of academic success, especially during an individual’s earlier years (García & Weiss, 2017). Most of the time, the poor stay poor, while the rich get richer. Even though inequality within a country is inevitable, there are, however, some countries that have shown impressive strides in narrowing the education discrepancy among high- and low-income families. Finland, for example, is one of the many countries that reconstructed its education system where all children receive the same education stream until 16 years old, regardless of their income bracket, disabilities, or gender (Saunders, 2016).


4. Data Exploration¶

4.1 Guiding Question 1¶

4.1.1 Question Overview:¶

Is there a relationship between happier countries and inequality in terms of education?¶

Getting an education, no matter the level, is associated with increased happiness, improved self-esteem, and better employment and income prospects. However, studies reveal a negative correlation between education inequality and happiness. While education generally enhances happiness, the article suggests that the perceived discrepancy in success compared to peers, especially in areas like social status and income, can lead to feelings of jealousy. This idea prompts the question: Does a country's education inequality significantly impact overall happiness?

4.1.2 Data Wrangling Process:¶

In [7]:
# Load the Excel file 
#  (DataForTable2.1WHR2023.xls is the happiness dataset. However, it was downloaded and
#  obtained directly from The World Happiness Report website instead of Kaggle.)

data_df = pd.read_excel("DataForTable2.1WHR2023.xls")

# Initialize a list to store DataFrames
top_countries_data_list = []

# For each year, find the top ten countries based on "Life Ladder" and add to the list
for year in sorted(data_df['year'].unique()):
    top_countries_year = data_df[data_df['year'] == year].nlargest(10, 'Life Ladder')
    
    # Check if Canada is in the top countries for the year, if not add it
    if 'Canada' not in top_countries_year['Country name'].values:
        canada_data_year = data_df[(data_df['Country name'] == 'Canada') & (data_df['year'] == year)]
        top_countries_year = pd.concat([top_countries_year, canada_data_year])
    
    top_countries_data_list.append(top_countries_year)

# Concatenate all DataFrames in the list
top_countries_data = pd.concat(top_countries_data_list)
top_countries_data.head(5)
Out[7]:
Country name year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
505 Denmark 2005 8.018934 10.849012 0.972372 68.300003 0.971135 NaN 0.236522 0.776689 0.153672
1380 Netherlands 2005 7.463979 10.809070 0.947358 70.699997 0.901008 NaN 0.571342 0.700739 0.232795
327 Canada 2005 7.418048 10.707299 0.961552 70.500000 0.957306 0.248033 0.502681 0.783299 0.233278
1849 Sweden 2005 7.376316 10.724154 0.951470 71.000000 0.964395 NaN NaN 0.742480 0.150766
76 Australia 2005 7.340688 10.662058 0.967892 69.800003 0.934973 NaN 0.390416 0.769770 0.238012

We will first create a data frame concerning only the most "happiest" countries across multiple years. Canada will also be captured and measured against these countries. Doing this will identify which countries Canada can learn from in terms of improving their citizen's happiness.

In [8]:
from sqlalchemy import text

# update null value to 0 under columns in Inequality
update_statement = text("""
    UPDATE inequality
    SET 
        `Inequality in Education (2010)` = COALESCE(`Inequality in Education (2010)`, 0),
        `Inequality in Education (2011)` = COALESCE(`Inequality in Education (2011)`, 0),
        `Inequality in Education (2012)` = COALESCE(`Inequality in Education (2012)`, 0),
        `Inequality in Education (2013)` = COALESCE(`Inequality in Education (2013)`, 0),
        `Inequality in Education (2014)` = COALESCE(`Inequality in Education (2014)`, 0),
        `Inequality in Education (2015)` = COALESCE(`Inequality in Education (2015)`, 0),
        `Inequality in Education (2016)` = COALESCE(`Inequality in Education (2016)`, 0),
        `Inequality in Education (2017)` = COALESCE(`Inequality in Education (2017)`, 0),
        `Inequality in Education (2018)` = COALESCE(`Inequality in Education (2018)`, 0),
        `Inequality in Education (2019)` = COALESCE(`Inequality in Education (2019)`, 0),
        `Inequality in Education (2020)` = COALESCE(`Inequality in Education (2020)`, 0),
        `Inequality in Education (2021)` = COALESCE(`Inequality in Education (2021)`, 0);
""")

with engine.connect() as connection:
    # Use a transaction
    with connection.begin() as transaction:
        connection.execute(update_statement)

There appeared to be multiple NULL values captured in the Education Inequality dataset. Leaving these cell values the way they are will disturb query output later on in the project. Therefore, it is better to clean them now to diminish potential issues. To do this, we will change any NULL values appearing in the education inequality score fields to 0. This way, we can still aggregate these data points if needed.

In [9]:
query_2 = pd.read_sql_query("SELECT * FROM happy WHERE year = 2020 ORDER BY 'Life Ladder' DESC;", engine)

query_2.head(20)
Out[9]:
index Country name year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 26 Albania 2020 5.364910 9.492215 0.710115 69.125000 0.753671 0.004123 0.891359 0.563486 0.265066
1 37 Algeria 2020 5.437755 9.291438 0.867649 66.500000 0.573891 -0.121148 0.724264 0.524169 0.310630
2 57 Argentina 2020 5.900567 9.887623 0.897104 67.150002 0.823392 -0.128546 0.815780 0.679317 0.342497
3 89 Australia 2020 7.137368 10.794416 0.936517 70.974998 0.905283 0.203258 0.491095 0.725689 0.205078
4 104 Austria 2020 7.213489 10.858776 0.924831 71.025002 0.911910 0.006125 0.463830 0.716352 0.206500
5 131 Bahrain 2020 6.173176 10.775782 0.847745 65.824997 0.945233 0.117025 NaN 0.729510 0.296835
6 146 Bangladesh 2020 5.279987 8.628981 0.739338 64.425003 0.777467 -0.022516 0.741659 0.484560 0.331709
7 176 Belgium 2020 6.838761 10.799146 0.903559 70.750000 0.766918 -0.170136 0.633627 0.619125 0.260189
8 192 Benin 2020 4.407746 8.066671 0.506636 55.799999 0.783115 -0.082945 0.531884 0.556502 0.304512
9 212 Bolivia 2020 5.559259 8.946366 0.804811 63.450001 0.877032 -0.052822 0.868208 0.728823 0.381791
10 227 Bosnia and Herzegovina 2020 5.515816 9.572131 0.898519 67.250000 0.740251 0.135483 0.916052 0.601871 0.325412
11 256 Brazil 2020 6.109718 9.548380 0.830832 65.574997 0.786235 -0.057945 0.728772 0.653300 0.389139
12 270 Bulgaria 2020 5.597723 10.020541 0.916242 66.400002 0.818225 -0.010107 0.900633 0.642388 0.221351
13 286 Burkina Faso 2020 4.639640 7.646704 0.667709 55.275002 0.750226 0.123013 0.808745 0.604512 0.388478
14 307 Cambodia 2020 4.376985 8.360816 0.724423 61.700001 0.963075 0.050262 0.863054 0.770770 0.389852
15 324 Cameroon 2020 5.241078 8.206723 0.720047 55.174999 0.674509 0.044859 0.836517 0.626285 0.386479
16 341 Canada 2020 7.024905 10.737793 0.930611 71.349998 0.886892 0.044788 0.434012 0.738485 0.306674
17 378 Chile 2020 6.150643 10.044019 0.888412 70.150002 0.781384 0.027615 0.811819 0.752885 0.336029
18 395 China 2020 5.771065 9.698723 0.808334 68.625000 0.891123 -0.106377 NaN 0.662961 0.244918
19 411 Colombia 2020 5.709175 9.502091 0.797035 69.175003 0.840186 -0.088326 0.807964 0.758543 0.340159

The SQL query above will retrieve 2020 information from the "happy" table. The outputted table will be ordered in descending order based on a country's happiness score (Life Ladder).

This query is useful as it helps identify which countries hold the highest happiness score, specifically during 2020. This information will be pivotal later on, as we will use it to compare it with education inequality score to see if there are any trends.

In [10]:
#-- example windowing function
query_5 = pd.read_sql_query("""
    SELECT 
        RANK() OVER (PARTITION BY year ORDER BY `Life Ladder` DESC) AS country_rank,
        `Country name`,
        `Life Ladder`,
        year
    FROM happy;

""", engine)

query_5.head(10)
Out[10]:
country_rank Country name Life Ladder year
0 1 Denmark 8.018934 2005
1 2 Netherlands 7.463979 2005
2 3 Canada 7.418048 2005
3 4 Sweden 7.376316 2005
4 5 Australia 7.340688 2005
5 6 Belgium 7.262290 2005
6 7 Venezuela 7.169621 2005
7 8 Spain 7.152786 2005
8 9 France 7.093393 2005
9 10 Saudi Arabia 7.079644 2005

Th query above uses information from the "happy" table. With the RANK() window function, we can assign each country a rank based on their happiness score for a certain year. Rankings will begin at 1, indicating the most happiest country, and will progress down accordingly. This function in conjunction with the PARTITION BY year clause will rank countries for each year separately. The output from this query will allow for comparative analysis where we can easily find the most happiest countries across multiple years.

In [11]:
query_11 = pd.read_sql_query("""
    SELECT happy.`Country name`, happy.year, happy.`Life Ladder`, inequality.`Inequality in Education (2020)`
    FROM happy
    INNER JOIN inequality
    ON happy.`Country name` LIKE inequality.Country || '%'
    WHERE happy.year = 2020
    ORDER BY inequality.`Inequality in Education (2020)` DESC;
""", engine)

We will now create a query that will investigate and attempt to answer our first guiding question, is there a correlation between a Country's happiness score and its education inequality score? To accomplish this task, the Happiness table and education inequality table will be joined using the inner join function. The join condition is based on a partial match, where the Country name from the happy table should start with the Country string from the inequality table (Inequality in Education around the World, n.d.). The query will retrieve information regarding country name, year, Life Ladder score, and Education Inequality score for 2020. Data will be filtered to present only records from 2020 and the results will be sorted based on life ladder score.

4.1.3 Data Visualization & Analysis:¶

First, we will take a glimpse of the data to understand what information we are working with. Through the visualization, we can more adequately analyze the happiest countries based on their "Life Ladder" score across multiple years.

In [12]:
import pandas as pd
import plotly.express as px

# Create an animated horizontal bar plot
fig = px.bar(top_countries_data, y="Country name", x="Life Ladder", color="Country name",
             animation_frame="year", 
             title="Top Ten Countries by Life Ladder Score Each Year (Including Canada)",
             labels={"Country name": "Country", "Life Ladder": "Life Ladder Score"},
             category_orders={"Country name": sorted(top_countries_data['Country name'].unique())},
             orientation='h',
             range_x=[6, top_countries_data['Life Ladder'].max() + 0.5])

# Adjusting the size of the figure (half of the default size)
fig.update_layout(width=1000, height=600)

# Slow down the animation speed (half the default speed)
fig.show(animation_options={"frame": {"duration": 10000}, "transition": {"duration": 500}})

# Export the plot as an HTML file
fig.write_html("animated_bar_plot.html")

The animated bar plot provides a dynamic and interactive way to observe changes in rankings over the years. This visualization will first give us a brief overview of how a country's happiness rank changed year-over-year. This visual also illustrates which countries are constantly performing well, noting them as key influencers whom others can learn from. Interestingly, we can also examine how Canada fares against other top ranking countries.

In [13]:
import plotly.graph_objects as go

testhappy = pd.read_sql_table('happy', engine)

latest_year = testhappy['year'].max()
top_10_countries_latest = testhappy[testhappy['year'] == latest_year].nlargest(10, 'Life Ladder')['Country name']
historical_data_top_10 = testhappy[testhappy['Country name'].isin(top_10_countries_latest)].copy()
historical_data_top_10['Rank'] = historical_data_top_10.groupby('year')['Life Ladder'].rank("dense", ascending=False)
pivot_data_top_10 = historical_data_top_10.pivot(index='Country name', columns='year', values='Rank')

fig_top_10 = go.Figure()

for country in pivot_data_top_10.index:
    fig_top_10.add_trace(go.Scatter(
        x=pivot_data_top_10.columns,
        y=pivot_data_top_10.loc[country],
        mode='lines+markers',
        name=country,
        line=dict(width=3)  # Set the line width to 3
    ))

fig_top_10.update_layout(
    title="Ranking of Top 10 Countries in Life Ladder Over Time",
    xaxis_title="Year",
    yaxis_title="Rank",
    height=400, width=1000,
    yaxis=dict(autorange="reversed"),
    showlegend=True
)

fig_top_10.show()
fig.write_html("bumpchart.html")

We will use a bump chart, which is available in the plotly library, to identify the top 10 countries that have ranked the highest from 2004 to 2022. By analyzing the chart, we can observe that Denmark, Finland, and Norway have consistently outperformed other countries in terms of meeting the needs of their citizens over the past three years.

In [14]:
import pandas as pd
import plotly.express as px

# Filtering the dataset for the year 2020
data_2020 = query_11[query_11['year'] == 2020]

# Creating the map visualization
fig = px.choropleth(data_2020, 
                    locations="Country name",
                    locationmode="country names",
                    color="Life Ladder",
                    hover_data=["Life Ladder", "Inequality in Education (2020)"],  # Add "Life Ladder" and "Inequality in Education (2020)" to hover data
                    color_continuous_scale="Viridis",
                    title="Life Ladder Scores and Inequality in Education (2020) Around the World in 2020")

# Set the height and width of the figure
fig.update_layout(height=500, width=1000, margin=dict(l=0, r=0, t=0, b=0))

# Show the figure (in an interactive Python environment)
fig.show()
fig.write_html("happymap.html")

Through Plotly's map function, we are better able to determine which countries are the happiest. From the map, we can clearly see that the Northern part of the globe holds a higher average happiness score. This is, however, contrasted by the Southern part of the globe, which shows that those countries typically hold a lower average happiness score during 2020.

In [15]:
#select few columns from each table then merge two tables into one by column name where happy.`Country name` LIKE inequality.Country
query_10 = pd.read_sql_query("""
    SELECT happy.`Country name`, happy.year, happy.`Life Ladder`, inequality.`Inequality in Education (2020)`
    FROM happy
    INNER JOIN inequality
    ON happy.`Country name` LIKE inequality.Country || '%'
    WHERE happy.year = 2020
    ORDER BY inequality.`Inequality in Education (2020)` DESC;
""", engine)

query_10.head(20)
Out[15]:
Country name year Life Ladder Inequality in Education (2020)
0 Guinea 2020 4.972168 50.124110
1 Senegal 2020 4.756773 47.135300
2 Ivory Coast 2020 5.256504 45.609040
3 Mali 2020 4.269474 43.859650
4 Benin 2020 4.407746 43.693550
5 Pakistan 2020 4.623969 43.527950
6 Ethiopia 2020 4.549220 42.770360
7 Morocco 2020 4.802618 41.870930
8 Nepal 2020 5.982410 41.099200
9 Nigeria 2020 5.502948 40.415070
10 Burkina Faso 2020 4.639640 39.152794
11 Bangladesh 2020 5.279987 37.311770
12 Egypt 2020 4.472397 36.942910
13 India 2020 4.223866 36.875730
14 Ghana 2020 5.319483 35.101670
15 Algeria 2020 5.437755 33.282620
16 Cameroon 2020 5.241078 31.716990
17 Tunisia 2020 4.730811 30.702850
18 Iraq 2020 4.785165 29.745910
19 Uganda 2020 4.640910 27.914380

This query highlights that countries with the lowest happiness scores are typically located farther south than Canada, often in African and Middle Eastern regions. From the table, the top 5 countries with the greatest education inequality score all reside from Africa.

In [16]:
import numpy as np
import matplotlib.pyplot as plt

# Create a smooth line
x = np.linspace(min(query_11['Inequality in Education (2020)']), max(query_11['Inequality in Education (2020)']), 100)
y = np.poly1d(np.polyfit(query_11['Inequality in Education (2020)'], query_11['Life Ladder'], 3))(x)

# Plot the scatter plot with smooth line
plt.scatter(query_11['Inequality in Education (2020)'], query_11['Life Ladder'], c=query_11['Life Ladder'], cmap='Oranges')
plt.plot(x, y, color='black', linewidth=2, linestyle='--')  # Use dashed line

# Set labels and title
plt.xlabel('Inequality in Education (2020)')
plt.ylabel('Life Ladder')
plt.title('Correlation between Inequality in Education (2020) and Life Ladder')

# Add colorbar
plt.colorbar(label='Life Ladder')

# Show the plot
plt.show()

By using a scatter plot created with Plotly, we can observe the relationship between the Happiness score and the Inequality in Education score. The scatter plot indicates a clear negative correlation between these two variables. As the Inequality in Education score goes up, the Life Ladder (happiness) score goes down. Therefore, we can conclude that there is an inverse relationship between these two factors.

4.1.4 Conclusion for Question 1:¶

Education Inequality score and Happiness score appear to have a negative correlation. This is depicted as Education Inequality increases, and the Life ladder (happiness) score decreases. It is also evident that there is a significant geographical difference in terms of happiness and educational inequality across the world. Countries located in the northern hemisphere tend to have higher average happiness scores and lower levels of educational inequality. On the other hand, countries in the southern hemisphere seem to have comparatively lower happiness scores and higher education inequality scores for the year 2020. This geographical contrast highlights the differences in societal well-being and educational access, underlining the need for a deeper understanding of the root causes driving these disparities.

4.2 Guiding Question 2¶

4.2.1 Question Overview:¶

Are there any relationships between population-related factors and the happiness of a country?¶

Happiness, which is closely related to life satisfaction, is influenced by various factors that contribute to one's overall quality of life. Some studies have indicated potential correlations between population-related factors and quality of life.

  • For instance, higher population density is often associated with lower levels of happiness due to challenges such as overcrowding and traffic congestion (Belikow et al., 2021; Li & Kanazawa, 2016).
  • Additionally, the percentage of the urban population plays a role in a country's happiness level, as urbanization tends to have a positive impact on individual happiness (Lenzi & Perucca, 2022).
  • Furthermore, fertility rates have been found to have a negative association with happiness, suggesting that having fewer children is linked to higher levels of happiness (Margolis & Myrskylä, 2011).

The provided evidence indicates that there may be links between happiness and population-related factors. We aim to investigate these connections at a country level by utilizing the chosen datasets. We will examine whether the factors mentioned earlier can be considered dependable indicators of a country's overall happiness level.

4.2.2 Data Wrangling Process:¶

In [17]:
query_datatype = """
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'population_db' OR TABLE_NAME = 'happiness_db';
"""

pd.read_sql_query(query_datatype, engine)
Out[17]:
COLUMN_NAME DATA_TYPE
0 index bigint
1 Country name text
2 Regional indicator text
3 Ladder score double
4 Standard error of ladder score double
5 upperwhisker double
6 lowerwhisker double
7 Logged GDP per capita double
8 Social support double
9 Healthy life expectancy double
10 Freedom to make life choices double
11 Generosity double
12 Perceptions of corruption double
13 Ladder score in Dystopia double
14 Explained by: Log GDP per capita double
15 Explained by: Social support double
16 Explained by: Healthy life expectancy double
17 Explained by: Freedom to make life choices double
18 Explained by: Generosity double
19 Explained by: Perceptions of corruption double
20 Dystopia + residual double
21 index bigint
22 Country text
23 Population bigint
24 Growth double
25 Net Change bigint
26 Density bigint
27 Area bigint
28 Migrants double
29 Fertility Rate double
30 Median Age double
31 Urban Population double
32 World Share double

The above query is performed to display the datatype of each column in both the happiness and population datasets. From the above table, there is a part that requires data wrangling to change the data type. It is noticed that the Growth, urban population and world share columns are in datatype of text because the values in these columns contain a percentage sign. For the sake of using the data from these columns for correlation analysis, the datatype has to be changed to double, which is a kind of float.

In [18]:
value_setting_statement = sq.text("""
UPDATE population_db
SET 
    Growth = REPLACE(Growth, '%', ''), 
    `Urban Population` = REPLACE(`Urban Population`, '%', ''), 
    `World Share` = REPLACE(`World Share`, '%', '');
""")

with engine.connect() as connection:
    connection.execute(value_setting_statement)

pd.read_sql_table("population_db", engine)
Out[18]:
index Country Population Growth Net Change Density Area Migrants Fertility Rate Median Age Urban Population World Share
0 0 China 1440297825 0.39 5540090 153 9388211 -348399.0 1.7 38.0 61.0 18.47
1 1 India 1382345085 0.99 13586631 464 2973190 -532687.0 2.2 28.0 35.0 17.70
2 2 United States 331341050 0.59 1937734 36 9147420 954806.0 1.8 38.0 83.0 4.25
3 3 Indonesia 274021604 1.07 2898047 151 1811570 -98955.0 2.3 30.0 56.0 3.51
4 4 Pakistan 221612785 2.00 4327022 287 770880 -233379.0 3.6 23.0 35.0 2.83
... ... ... ... ... ... ... ... ... ... ... ... ...
230 230 Montserrat 4993 0.06 3 50 100 NaN NaN NaN 10.0 0.00
231 231 Falkland Islands 3497 3.05 103 0 12170 NaN NaN NaN 66.0 0.00
232 232 Niue 1628 0.68 11 6 260 NaN NaN NaN 46.0 0.00
233 233 Tokelau 1360 1.27 17 136 10 NaN NaN NaN 0.0 0.00
234 234 Holy See 801 0.25 2 2003 0 NaN NaN NaN NaN 0.00

235 rows × 12 columns

This query removes percentage signs from the 'Growth,' 'Urban Population,' and 'World Share' columns. The objective is to convert these columns from string representations of percentages to numeric values, facilitating analysis.

In [19]:
datatype_modifying_statement = sq.text("""
ALTER TABLE population_db
MODIFY COLUMN Growth DOUBLE,
MODIFY COLUMN `Urban Population` DOUBLE,
MODIFY COLUMN `World Share` DOUBLE;
""")

with engine.connect() as connection:
    connection.execute(datatype_modifying_statement)

This query changes the data type of the 'Growth,' 'Urban Population,' and 'World Share' columns from their previous data type to double.

In [20]:
query_datatype = """
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'population_db' OR TABLE_NAME = 'happiness_db';
"""

pd.read_sql_query(query_datatype, engine)
Out[20]:
COLUMN_NAME DATA_TYPE
0 index bigint
1 Country name text
2 Regional indicator text
3 Ladder score double
4 Standard error of ladder score double
5 upperwhisker double
6 lowerwhisker double
7 Logged GDP per capita double
8 Social support double
9 Healthy life expectancy double
10 Freedom to make life choices double
11 Generosity double
12 Perceptions of corruption double
13 Ladder score in Dystopia double
14 Explained by: Log GDP per capita double
15 Explained by: Social support double
16 Explained by: Healthy life expectancy double
17 Explained by: Freedom to make life choices double
18 Explained by: Generosity double
19 Explained by: Perceptions of corruption double
20 Dystopia + residual double
21 index bigint
22 Country text
23 Population bigint
24 Growth double
25 Net Change bigint
26 Density bigint
27 Area bigint
28 Migrants double
29 Fertility Rate double
30 Median Age double
31 Urban Population double
32 World Share double

Now, the datatype of all columns looks to be correct and the data is ready for use.

In [21]:
population_db = pd.read_sql_table("population_db", engine)
population_db.isna().sum() 
Out[21]:
index                0
Country              0
Population           0
Growth               0
Net Change           0
Density              0
Area                 0
Migrants            34
Fertility Rate      34
Median Age          34
Urban Population    13
World Share          0
dtype: int64

In the selected dataset, certain columns such as "Migrants", "Fertility Rate", "Median Age", and "Urban Population" contain null values. It is important to note that the "Migrants" column is not utilized in this project.

In [22]:
query_join1 = """
SELECT `Country name`, `Regional indicator`, Population, Growth, Density, `Fertility Rate`, `Median Age`, `Urban Population`, `Ladder score`
FROM happiness_db INNER JOIN population_db
ON happiness_db.`Country name` LIKE CONCAT(population_db.Country, '%')
GROUP BY `Country name`
ORDER BY `Ladder score` DESC;
"""

joinedtable = pd.read_sql_query(query_join1, engine)
joinedtable
Out[22]:
Country name Regional indicator Population Growth Density Fertility Rate Median Age Urban Population Ladder score
0 Finland Western Europe 5542237 0.15 18 1.5 43.0 86.0 7.8087
1 Denmark Western Europe 5795780 0.35 137 1.8 42.0 88.0 7.6456
2 Switzerland Western Europe 8665615 0.74 219 1.5 43.0 74.0 7.5599
3 Iceland Western Europe 341628 0.65 3 1.8 37.0 94.0 7.5045
4 Norway Western Europe 5428594 0.79 15 1.7 40.0 83.0 7.4880
... ... ... ... ... ... ... ... ... ...
141 Central African Republic Sub-Saharan Africa 4843954 1.78 8 4.8 18.0 43.0 3.4759
142 Rwanda Sub-Saharan Africa 13005303 2.58 525 4.1 20.0 18.0 3.3123
143 Zimbabwe Sub-Saharan Africa 14899771 1.48 38 3.6 19.0 38.0 3.2992
144 South Sudan Sub-Saharan Africa 11216250 1.19 18 4.7 19.0 25.0 2.8166
145 Afghanistan South Asia 39074280 2.33 60 4.6 18.0 25.0 2.5669

146 rows × 9 columns

The population database and the happiness database are merged using the country name as the key column. An inner join is performed to ensure that only countries with data in both databases are included. Since the country names in the two databases may have variations, such as short forms or special characters, the LIKE operator is used to match similar country names. The resulting joined table is used for analysis and exploration in the following.

In [23]:
joinedtable.isna().sum()
Out[23]:
Country name          0
Regional indicator    0
Population            0
Growth                0
Density               0
Fertility Rate        0
Median Age            0
Urban Population      4
Ladder score          0
dtype: int64

Upon merging the happiness dataset to create the joined table, it is observed that only one column ("Urban Population") contains null values. Instead of removing these rows due to the null values, a decision is made to keep them. The rationale behind this decision is that the number of rows affected by null values is minimal (~2%), specifically limited to four values.

In addition to the limited impact, the choice to keep these rows is reinforced by the recognition that the remaining columns in these specific rows still contain valuable and meaningful information. Because of this, it is deemed more beneficial to keep all the rows in the joinedtable, including the rows with null values, rather than removing them.

4.2.3 Data Visualization & Analysis:¶

In [24]:
import plotly.express as px

# Data preparation
correlation_data = joinedtable[['Population', 'Growth', 'Density', 'Fertility Rate', 'Median Age', 'Urban Population', 'Ladder score']]
correlation_matrix = correlation_data.corr() # Calculate the correlation matrix

# Creating the heatmap with values displayed
fig = px.imshow(correlation_matrix, 
                x = correlation_matrix.columns, 
                y = correlation_matrix.columns,
                text_auto = '.2f')  # Automatically add text in each cell
fig.update_layout(title = "Correlation Heatmap of Happiness Ladder Score and Population Related Parameters",
                  height = 600, 
                  width = 1000, 
                  paper_bgcolor='black',  # Background color
                  font_color='white',   # General text color
                  title_font_color='white') # Title text color


fig.show()

This heatmap illustrates the relationships among various variables, such as population, population growth, density, fertility rate, median age, percentage of urban population, and happiness ladder score. Notably, the ladder score exhibits robust associations with median age, urban population, and fertility rate, as evidenced by high correlation coefficients exceeding 0.6. Additionally, a significant link exists between ladder score and population growth. However, given the high inter-correlations between growth, median age, and fertility rate, and the comparatively weaker correlation with ladder score, we decide to exclude the growth parameter to mitigate the impact of multicollinearity among independent variables.

While considerable intercorrelations persist between median age, fertility rate, and urban population percentage, these variables also share similarly high correlations with the happiness score. Consequently, a more in-depth analysis of these variables is needed to explore the most reliable indicator of happiness level.

In [25]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Ladder Score vs Fertility Rate
fig1 = px.scatter(joinedtable, 
                 x = 'Fertility Rate',
                 y = 'Ladder score', 
                 trendline = 'ols', # Add a trendline
                 color_discrete_sequence = ['skyblue'], # Set color of dots
                 trendline_color_override = 'lightblue') # Set color of trendline

# Ladder Score vs Median Age
fig2 = px.scatter(joinedtable, 
                  x = 'Median Age', 
                  y = 'Ladder score',
                  trendline = 'ols', # Add a trendline
                  color_discrete_sequence=['lightcoral'], # Set color of dots
                  trendline_color_override = 'lightpink') # Set color of trendline

# Ladder Score vs Urban Population
fig3 = px.scatter(joinedtable, 
                  x = 'Urban Population', 
                  y = 'Ladder score',
                  trendline = 'ols', # Add a trendline
                  color_discrete_sequence=['lightseagreen'], # Set color of dots
                  trendline_color_override = 'lightgreen') # Set color of trendline

# Create subplots
fig = make_subplots(rows = 1, cols = 3, subplot_titles=("Ladder Score vs Fertility Rate",
                                                        "Ladder Score vs Median Age", 
                                                        "Ladder Score vs Urban Population")) 

# Add the scatter plots to the subplots
for data in fig1.data:
    fig.add_trace(data, row = 1, col = 1)

for data in fig2.data:
    fig.add_trace(data, row = 1, col = 2)

for data in fig3.data:
    fig.add_trace(data, row = 1, col = 3)

# Update x-axis and y-axis titles for each subplot
fig.update_yaxes(row = 1, col = 1, title_text = "Ladder Score")

fig.update_xaxes(row = 1, col = 1, title_text = "Fertility Rate (%)")
fig.update_xaxes(row = 1, col = 2, title_text = "Median Age")
fig.update_xaxes(row = 1, col = 3, title_text = "Urban Population (%)")

# Update layout
fig.update_layout(height = 400, 
                  width = 1000, 
                  paper_bgcolor = 'black', # Background color
                  font_color= 'white', # General font color
                  plot_bgcolor = 'black', # Plot background color
                  title_text = "Scatterplots of Happiness Score vs Potential Influencing Factors")
fig.update_traces(marker = dict(size = 8, line = dict(width = 1.5, color = 'black')),
                  selector =dict(mode = 'markers')) # Set marker size and color
fig.show()

We generate scatterplots to explore the relationships between the happiness score and the three identified factors. Our observations reveal noticeable correlations in all three plots, where the ladder score demonstrates a negative association with fertility rate and a positive association with both median age and the percentage of urban population. Upon closer examination of individual plots, a concentration of data points is noticed at low fertility rates, forming a curve-like distribution. Consequently, when compared with the other two variables, fertility rate appears less optimal as an indicator.

In contrast, both median age and urban population percentage exhibit similar distributions, with data points evenly dispersed along the fitted line. Consequently, both factors appear suitable as indicators of happiness level. However, considering the high intercorrelation between these two factors, we should choose either one. Based on correlation coefficients, median age slightly outperforms urban population percentage, being the most favorable indicator of happiness level among all population-related parameters. The strong correlation between happiness and median age may be attributed to the idea that happier individuals tend to live longer, and hence result in higher median age of people in happier countries.

4.2.4 Conclusion for Question 2:¶

In conclusion, this analysis provides valuable insights into the dataset and highlights the relationships between happiness level and different population-related variables. We found negative correlations with fertility rate and positive correlations with median age and urban population. While both median age and urban population are suitable happiness indicators, due to their high intercorrelation, median age is considered as a more reliable happiness level indicator among population-related factors. The findings can be used to further explore and understand the factors influencing happiness levels in the population.

4.3 Guiding Question 3¶

4.3.1 Question Overview:¶

Is there a significant relationship between population and the amount of education inequality a country has?¶

This inquiry aims to determine whether countries with larger populations exhibit more academic inequality due to their size or if educational inequality arises from other factors, such as a country's improper allocation of resources to meet academic demand. Studies have shown that an individual's social class significantly impacts their academic success, particularly during their formative years (García & Weiss, 2017). Unfortunately, it is often the case that those who start with less tend to stay behind, while those who start with more tend to keep advancing. Despite the inevitability of inequality within a country, some have made progress in narrowing the education gap among high and low-income families. One such example is Finland, which overhauled its education system, offering students the same education stream until the age of 16, regardless of their income, disability, or gender (Saunders, 2016).

4.3.2 Data Wrangling Process:¶

In [26]:
query = """
    SELECT h.`Country name`,
           e.`Human Development Groups`,
           e.`Inequality in Education (2020)`,
           p.`Density (P/Km²)`
    FROM happiness h
    INNER JOIN education e ON h.`Country name` = e.`Country`
    INNER JOIN population2020 p ON h.`Country name` = p.`Country (or dependency)`
"""

table_q1 = pd.read_sql_query(query, engine)

table_q1
Out[26]:
Country name Human Development Groups Inequality in Education (2020) Density (P/Km²)
0 China High 11.680000 153
1 India Medium 36.875730 464
2 United States Very High 2.745850 36
3 Indonesia High 17.250017 151
4 Pakistan Low 43.527950 287
... ... ... ... ...
134 Montenegro Very High 7.815010 47
135 Luxembourg Very High 4.717070 242
136 Maldives High 29.319480 1802
137 Malta Very High 5.192380 1380
138 Iceland Very High 2.200770 3

139 rows × 4 columns

This query divides the data between different Human Development Groups. This was done to fairly compare countries undergoing similar Human Development Index attributes (i.e., life expectancy, education, and per capita income). Population density was used in favor to population as it provides a more general sense of population in a given area. The following table will be used to determine if a more populated country in regards to density affects education inequality.

In [27]:
# check for unnecssary null values

table_q1[table_q1.isna().any(axis=1)]
Out[27]:
Country name Human Development Groups Inequality in Education (2020) Density (P/Km²)
92 Libya High NaN 4
129 Trinidad and Tobago Very High NaN 273

The code above will check for any unnecssary null values.

In [28]:
# remove null rows

final1 = table_q1.dropna(how='any',axis=0) 
final1 = final1.reset_index()
final1.head(10)
Out[28]:
index Country name Human Development Groups Inequality in Education (2020) Density (P/Km²)
0 0 China High 11.680000 153
1 1 India Medium 36.875730 464
2 2 United States Very High 2.745850 36
3 3 Indonesia High 17.250017 151
4 4 Pakistan Low 43.527950 287
5 5 Brazil High 15.652570 25
6 6 Nigeria Low 40.415070 226
7 7 Bangladesh Medium 37.311770 1265
8 8 Mexico High 13.487955 66
9 9 Japan Very High 4.511000 347

Since there were instances of null values in the table, we will use the code above to remove any null value data instances, while reseting the index.

In [29]:
query = """
SELECT
    h.`Regional indicator`,
    AVG(e.`Inequality in Education (2020)`) AS avg_inequality_2020
FROM
    happiness h
INNER JOIN
    education e ON h.`Country name` = e.`Country`
GROUP BY
    h.`Regional indicator`;

"""

final2 = pd.read_sql_query(query, engine)

# Produce final table to use in visualization
final2
Out[29]:
Regional indicator avg_inequality_2020
0 Central and Eastern Europe 5.478769
1 Commonwealth of Independent States 3.117286
2 East Asia 9.227668
3 Latin America and Caribbean 16.096977
4 Middle East and North Africa 22.042057
5 North America and ANZ 2.540888
6 South Asia 35.070890
7 Southeast Asia 17.202814
8 Sub-Saharan Africa 34.618264
9 Western Europe 5.761916

This query seeks to identify the regions with the highest education inequality. Furthermore, the query's output serves as a statistical benchmark to help countries determine how they compare to their region's average education inequality score. The hope is that countries with higher education inequality scores located in more equitable regions will be encouraged to assess and implement more equitable infrastructure to reduce their education inequality score.

The table above seems to not have any null values, therefore no additional measures need to be done to remove null data instances.

4.3.3 Data Visualization & Analysis:¶

In [30]:
import plotly.express as px

# Create Scatter plot:
fig = px.scatter(final1, y="Inequality in Education (2020)", x="Density (P/Km²)",
                 color="Human Development Groups", trendline="ols", labels={
                     "Inequality in Education (2020)": "Education Inequality (2020)",
                     "Density (P/Km²)": "Population Density (P/Km²)"
                 },
                title="Population Density vs Education Inequality for 2020")

# update plot size:
fig.update_layout(
    autosize=False,
    width=1200,
    height=500,)

fig.update_xaxes(range = [-75,2000])

# show figure:
fig.show()

From the scatterplot above, the trend lines for medium, high, and very high human development groups indicate that as the density population increases, so does education inequality. This notion is consistent with a Harvard study that examines the relationship of population density to educational inequality in the United States. In this report, researchers found that communities with denser population levels had a higher socioeconomic gap, which contributed to the increase in inequality in the area (Carlos, n.d.). Although most of the data supports the previous study, low human development countries show a negative relationship between population density and education inequality. This means that as the population density increases, education inequality decreases. One of the possible reasons behind this result is that urbanization has positive effects on rural-like environments. As the population density increases, there is more access to resources and institutions including healthcare, transportation, and education. This suggests that more access to educational resources in rural areas can reduce education inequality (Bloom & Khanna, 2007).

In [31]:
import plotly.express as px

# Create bar chart:
fig1 = px.bar(final2, x="Regional indicator", y="avg_inequality_2020", labels={
                     "avg_inequality_2020": "Average Education Inequality Score (2020)",
                     "Regional indicator": "Region"
                 },
              title="Average Education Inequality Score by Regional in 2020", color = "Regional indicator")

# update plot size:
fig1.update_layout(
    autosize=False,
    width=1000,
    height=500,)

# remove legend as it was not necessary since it was the same as x-axis labels
fig1.update_layout(showlegend=False)

# show figure:
fig1.show()    

Inequality is present in every society, and it can manifest in various forms such as gender, age, ethnicity, origin, disability, or social class disparities (United Nations, n.d.). When it comes to education inequality, based on observations of several countries' education systems, three main factors seem to affect the dependent variable.

  1. Social Class: Countries such as the USA demonstrate greater education inequality due to discrepancies in social class, in which poor people typically attend lower-quality schools. This deprives citizens of equal opportunities and hinders students from meeting their fullest potential (Wiher, 2020).
  2. Origin: Similar to social class, foreign students who are struggling academically typically remain in the same situation. This is due to a country's late enrollment and early selection system in which students of different academic levels are divided at the start of the year. While high academic achievers remain with peers similar, the notion is the same for those who are not as well off. This makes it extremely difficult for children from educationally disadvantaged families to have enough time to compensate for their disadvantages by interacting with children who reside in better-off families in terms of education and social class (Wiher, 2020).
  3. Gender: In some areas, girls are not allowed to go to school because of social beliefs. This notion deprives girls of their right to education (Wiher, 2020). Pakistan is one example that discriminates against females for their rights to receive an education due to lack of proper resources, age, and fear of violence.

The bar graph above displays the differences in education inequality among various regions in the world during 2020. Some regions have relatively lower average education inequality scores than others. For instance, the North America and ANZ region scored the lowest with an average education inequality score of 2.54, whereas regions such as South Asia and Sub-Saharan Africa recorded higher-than-average scores of over 30. This variation in education inequality could be attributed to the different types of inequality present and the extent to which they exist within each region. Mismanagement of the above factors could lead to a rise in education inequality, which can negatively impact a country's education equality system.

Other reasons behind a specific region's education inequality score:

  • Although countries like Switzerland and Finland are considered to be happier and have a lower education inequality score compared to the USA, the overall average of Western Europe, to which they belong, is still higher than its counterpart. This is because some other countries in the region, like Spain, exhibit greater inequality due to the effects of socioeconomic origin and status. Low socioeconomic status often leads to poor academic performance within the country (Mendizabal, 2018).

  • Over the years, many countries have made significant improvements to reduce education inequality. Although some countries still discriminate against women's education rights, most countries have made efforts to expand their education systems. For instance, in the mid-2000s, women in China gained more opportunities to receive education. As a result, gender equality in the region improved over time, and the gap in college enrollment was even reversed to favor women in 2005 (Wu & Zhang, 2010).

  • In terms of education inequality, East Asia has a slightly higher score compared to Western countries. In addition to social class, this fact can be attributed to the competitive and rigorous academic structure in Eastern countries, where good grades are essential for academic and professional advancement. In China, for instance, students have to pass entrance exams to gain admission to prestigious high schools and universities in the future. This competitive environment impacts students' mental health, which leads some students to cheat on their exams for a better life (Fang, 2019).

4.3.4 Conclusion for Question 3:¶

In conclusion, the analysis of education inequality across different human development groups and regions in 2020 has highlighted significant disparities and multifaceted causes. While increasing population density corresponds to higher education inequality in medium, high, and very-high-development countries, low-human-development countries present a counterintuitive negative relationship, suggesting that increased population density can reduce education inequality by enhancing access to resources in rural environments. Thus, it is crucial to consider regional dynamics and urbanization effects in addressing global education disparities.

In addition to other factors, social class, origin-based discrimination, and gender-related discrimination all play a significant role in creating education inequality. North America, Australia, and New Zealand have been successful in managing these factors, resulting in a fair distribution of education compared to other parts of the world.


5. Discussion and Reflection¶

5.1 What has each group member learned from this project?¶

  1. Team Member 1 - Bo Li:

    • From this project, I have gained valuable insights into the significance of data analysis and visualization, understanding the pivotal role they play in making informed decisions. This experience has equipped me with skills in cleaning and manipulating data to ensure its readiness for analysis, while also honing my proficiency in using SQL queries to connect to databases, retrieve, and seamlessly join data from multiple tables. Ultimately, the project has not only deepened my understanding of drawing meaningful conclusions from data analysis and visualization but has also provided practical, hands-on expertise, in which I can apply diverse techniques and tools to real-world datasets.
  1. Team Member 2 - Brian Ho:

    • Through this project, I have picked up some important lessons. I have learned how looking at data and creating visual charts can help me understand things better and make smarter decisions. The hands-on experience with cleaning and organizing data has equipped me with practical skills for data preparation. Using SQL query to extract information from databases and bring them together from different tables have provided me a practical understanding of database interactions and how to work with MYSQL. Moreover, this project showed me how to draw useful conclusions from my analysis and visualizations. Overall, it was a great opportunity to practice using different techniques and tools with real-world data.
  1. Team Member 3 - Ethan Mah:

    • During my time in DATA 604 and after completing the final project, I can confidently say that I have gained valuable knowledge and skills related to data management, even though it was only for a short period. What I will take away from this experience are the technical skills that I developed, particularly in SQL coding. Additionally, I have learned how to analyze datasets effectively to extract meaningful information from them.Throughout the project proposal, individual milestone, and final project, I was consistently challenged to think about how my team's selected datasets could relate to each other and how I could build queries that reflected those ideas. Through this process, I have developed a solid understanding of the SQL syntax that allowed me to extract information from joined datasets to answer relevant questions. Although I have only received a brief introduction to SQL, which was sufficient to complete the project, I am eager to learn more about this language and how it can be used to analyze and interpret data.

5.2 What would you do differently now?¶

  1. Team Member 1 - Bo Li:

    • To enhance the overall quality and depth of the project, I propose implementing several improvements. These include documenting the entire workflow, encompassing data cleaning steps, analysis techniques, and visualization methods, to improve reproducibility and efficiency. Furthermore, automating repetitive tasks through scripts or functions would save time and enhance the scalability of the project. Through these changes, we could have enhanced our analysis in terms of quality and depth, fostering a more profound understanding of the relationships between population, inequality, and happiness.
  1. Team Member 2 - Brian Ho:

    • If we were to run the project again or make improvements, I would suggest refining our approach to dataset selection and integration. In this group project, we initially chose three datasets (happiness, population, and education inequality datasets) to conduct correlation analysis on happiness levels and various factors. However, reflecting on the project, I realize that more careful consideration in dataset selection could lead to a more comprehensive analysis. Instead of just looking at happiness, population, and education inequality, we could consider a wider range of datasets that cover different dimensions of social well-being. This might include factors such as economic indicators, healthcare, or cultural aspects. By looking at more varied data, we could get a fuller picture of what makes people happy. Additionally, more collaborative data wrangling and cleaning efforts among team members could ensure a more unified and streamlined approach. This could result in a more cohesive and efficient analysis.
  1. Team Member 3 - Ethan Mah:

    • Initially, due to project deadlines and the limited availability of suitable data capturing various countries in 2020, we were constrained in exploring more profound topics. If given the opportunity to run this project again, I would have liked to delve deeper into various datasets and uncover additional correlations that could potentially impact a country's happiness. Conducting thorough research on other subjects in diverse settings such as business, ecology, or healthcare will undoubtedly improve the quality of our report. We also could have established an entity-relationship structure to join datasets. Doing this opens the opportunity to answer more profound questions relating to global happiness.

5.3 Are there potential opportunities for future work?¶

  1. Team Member 1 - Bo Li:
    • To enhance our exploration of happiness scores, we aim to expand our dataset by identifying correlated variables such as crime rates, GDP, and food bank adoption rates. Additionally, our project could implement real-time visualizations to continuously monitor Key Performance Indicators, particularly those disseminated by the government, such as education-related metrics. Furthermore, collaborating and sharing our findings with peers and field experts can potentially provide us with invaluable and diverse feedback. Sharing our discoveries through presentations or publications will further contribute to the broader knowledge base, ensuring the distribution of valuable findings to a wider audience.
  1. Team Member 2 - Brian Ho:

    • For future work, there are several opportunities for additional exploration. Incorporating factors such as GDP per capita and life expectancy, in line with the official criteria from the World Happiness Report, could provide a more comprehensive understanding of happiness determinants. By analyzing these factors with concrete and objective data, we could discover deeper insights into their relationships with happiness. An intriguing avenue for future work could involve constructing a regression model that includes all identified influential factors. This model could be tested to determine the significance of each factor and assess how well they collectively explain happiness levels, contributing to valuable findings.
  1. Team Member 3 - Ethan Mah:
    • If we had more time to continue this project, I would be eager to delve into agriculture data, investigating the correlation between a country's resource production capacity and its overall happiness. Additionally, exploring how the level of imported goods influences a nation's happiness would be intriguing, questioning whether countries relying more on imports exhibit higher happiness scores. Furthermore, analyzing weather data can reveal crucial information about how the average rainfall or precipitation affects the happiness of a nation. This can lead to investigations into whether countries with higher rainfall are less happy or if there is no correlation between the two factors at all.

6. Conclusion¶

In reference to our guiding questions, we found a negative relationship between a country's happiness score and education inequality, in which as education inequality increased, the happiness score decreased. Our study also investigated population-related factors that influence happiness, finding that median age is the most reliable indicator to determine a person's happiness. In regards to geographical locations, countries in the northern part of the world display a tendency to be happier than their southern counterparts. In addition, more densely populated countries tend to possess greater education inequality scores. This statement is true except for developing countries with low human development index where urbanization has reduced education inequality.

Our analysis was meant to show, through breaking down survey information and large statistical datasets, how a country's happiness score can be influenced by factors beyond what was recorded in The World Happiness Report. Our specific datasets provided invaluable insights into the key demographic indicators and optimal geographical locations that have the greatest effect on one's happiness. While more work can still be done to create a more comprehensive analysis, our current findings on the subject can still prove to be useful for policymakers and individuals alike who strive for a happier life.


References¶

Belikow, A., DeWeese, J., Ravensbergen, L., Kestens, Y., & El-Geneidy, A. (2021). Are we happy in densely populated environments? assessing the impacts of density on subjective well-being, quality of life, and perceived health in Montreal, Canada. Findings. https://doi.org/10.32866/001c.23718

Bloon, D., & Khanna, T. (2007, September). The urban revolution. Retrieved from https://www.imf.org/external/pubs/ft/fandd/2007/09/bloom.htm

Carlos. (n.d.). Population density and educational inequality: the role of public school choice and accountability. Retrieved from https://scholar.harvard.edu/files/clastraanadon/files/lastra_density18.pdf

Cuñado, J., & Gracia, F. (2011, June 03). Does education affect happiness? Retrieved from https://link.springer.com/article/10.1007/s11205-011-9874-x#citeas

Fang, A. (2019, December 21). Chinese education system versus U.S education system. Retrieved from https://blog.tutorabcchinese.com/chinese-culture/chinese-education-system-vs-us-education-system

Garcia, E., & Weiss, E. (2017, September 27). Education inequalities at the school starting gate. Retrieved from https://www.epi.org/publication/education-inequalities-at-the-school-starting-gate/

Lenzi, C., & Perucca, G. (2022). No place for poor men: On the asymmetric effect of urbanization on Life Satisfaction. Social Indicators Research, 164(1), 165–187. https://doi.org/10.1007/s11205-022-02946-1

Li, N. P., & Kanazawa, S. (2016). Country roads, take me home… to My Friends: How Intelligence, population density, and friendship affect modern happiness. British Journal of Psychology, 107(4), 675–697. https://doi.org/10.1111/bjop.12181

Margolis, R., & Myrskylä, M. (2011). A global perspective on happiness and fertility. Population and Development Review, 37(1), 29–56. https://doi.org/10.1111/j.1728-4457.2011.00389.x

Mendizabal, A. (2018, December). Socioeconomic inequalities and academic achievement in Spain. Retrieved from https://elobservatoriosocial.fundacionlacaixa.org/en/-/desigualdades-socioeconomicas-y-rendimiento-academico

Nikolaev, B. (2016, January 26). Does other people’s education make us less happy? Retrieved from https://www.sciencedirect.com/science/article/pii/S0272775715300194#preview-section-cited-by

Saunders, D. (2016, April 22). Finland’s social climbers: how they’re fighting inequality with education, and winning. Retrieved from https://www.theglobeandmail.com/news/national/education/how-finland-is-fighting-inequality-with-education-andwinning/article29716845/

Teahcing about educational inequality. (n.d.). Retrieved from https://educators4sc.org/teaching-about-educational-inequality/

United Nations. (n.d.). Inequality - bridging the divide. Retrieved from https://www.un.org/en/un75/inequality-bridging-divide

Wiher, A. (2020). Education inequality in different countries. Retrieved from https://wp-prd.let.ethz.ch/WP0-CIPRF91243/chapter/education-inequality-in-different-countries-2/

Wu, X., & Zhang, Z. (2010, May 11). Changes in educational inequality in China. Retrieved from https://www.emerald.com/insight/content/doi/10.1108/S1479-3539%282010%290000017007/full/html ‌